Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.
translated by 谷歌翻译
Reranking methods in machine translation aim to close the gap between common evaluation metrics (e.g. BLEU) and maximum likelihood learning and decoding algorithms. Prior works address this challenge by training models to rerank beam search candidates according to their predicted BLEU scores, building upon large models pretrained on massive monolingual corpora -- a privilege that was never made available to the baseline translation model. In this work, we examine a simple approach for training rerankers to predict translation candidates' BLEU scores without introducing additional data or parameters. Our approach can be used as a clean baseline, decoupled from external factors, for future research in this area.
translated by 谷歌翻译
Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.
translated by 谷歌翻译
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
translated by 谷歌翻译
NLP基准在很大程度上主要集中在短篇文本上,例如句子和段落,即使长文本在野外占相当数量的自然语言。我们介绍卷轴,这是一套需要在长文本上推理的任务套件。我们检查现有的长文本数据集,文本自然是长期的,同时优先考虑涉及在输入上扫描信息的任务。滚动包含概述,问题应答和自然语言推理任务,包括多个域,包括文学,科学,业务和娱乐。初始基线(包括啰覆编码器),表明滚动有充足的改进空间。我们以统一的文本到文本格式提供所有数据集,并托管Live Refordboard,以促进模型架构和预用方法的研究。
translated by 谷歌翻译
对于开放式域问题的密集检索已被证明通过在问题通道对的大型数据集上培训来实现令人印象深刻的性能。我们调查是否可以以自我监督的方式学习密集的检索,并有效地应用没有任何注释。我们观察到这种情况下的检索斗争的现有借用模型,并提出了一种设计用于检索的新预制方案:重复跨度检索。我们在文档中使用经常性跨度来创建用于对比学习的伪示例。由此产生的模型 - 蜘蛛 - 在广泛的ODQA数据集上没有任何示例,并且与BM25具有竞争力,具有强烈的稀疏基线。此外,蜘蛛通常优于DPR在其他数据集的问题上培训的DPR培训的强大基线。我们将蜘蛛与BM25结合的混合猎犬改进了所有数据集的组件,并且通常与域中DPR模型具有竞争力,这些模型培训数万例培训。
translated by 谷歌翻译
许多NLP任务需要处理超出预磨模模型的长度限制的长语境。为了将这些模型扩展到更长的文本序列,已经提出了许多有效的远程注意力变体。尽管沿着这个方向进行了丰富的研究,但仍然难以在实际用例中衡量这些模型的相对有效性,例如,如果我们在预先rain-yfetune范式之后应用这些模型。在这项工作中,我们的目标是对这些具有大规模和受控实验的这些新兴模型进行彻底的分析。对于每个关注变体,我们使用相同的长DOC语料库,然后使用相同的长DOC语料库,然后为现实世界的长情节任务进行芬特这些模型。我们的调查结果揭示了现有广泛使用的远程基准的陷阱,并显示任何经过测试的高效关注可以在标准预介质范式下击败一个简单的本地窗口关注。对本地注意力变化的进一步分析表明,即使是常用的注意力窗口重叠也没有必要实现良好的下游结果 - 使用不相交的本地关注,我们能够构建符合性能的更简单且更高效的Long-Doc QA模型霍尔福勒〜\ citep {longformer}其预先花费的一半。
translated by 谷歌翻译
标准预审进的语言模型可在子字代币序列上运行,而无需直接访问组成每个令牌字符串表示的字符。我们探究了预审前的语言模型的嵌入层,并表明模型在一个令人惊讶的程度上学习了整个单词和子字代币的内部字符组成,而没有看到字符和令牌。我们的结果表明,罗伯塔(Roberta)的嵌入层具有足够的信息,可以准确地阐明词汇的三分之一,并在所有令牌类型上达到高平均角色Ngram重叠。我们进一步测试了使用其他字符信息丰富子词模型是否可以改善语言建模,并观察到该方法具有几乎相同的学习曲线,作为训练而无需基于拼写的丰富。总体而言,我们的结果表明,语言建模目标激励模型隐式学习一些拼写概念,并且明确教授模型如何拼写的方式似乎并没有增强其在此类任务上的绩效。
translated by 谷歌翻译